Goto

Collaborating Authors

 correct option




Appendix A

Neural Information Processing Systems

Q: For what purpose was the dataset created? Q: Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., Q: Who funded the creation of the dataset? Q: What do the instances that comprise the dataset represent (e.g., documents, photos, people, Q: How many instances are there in total (of each type, if appropriate)? As shown in Table 1, the dataset statistics are as follows: Grounding Task: 111,770 samples for training, 21,616 samples for testing. For grounding, we use only one annotation per image.



Metric-Fair Prompting: Treating Similar Samples Similarly

Wang, Jing, Shen, Jie, Niu, Xing, Zhang, Tong, Weiss, Jeremy

arXiv.org Artificial Intelligence

We introduce \emph{Metric-Fair Prompting}, a fairness-aware prompting framework that guides large language models (LLMs) to make decisions under metric-fairness constraints. In the application of multiple-choice medical question answering, each {(question, option)} pair is treated as a binary instance with label $+1$ (correct) or $-1$ (incorrect). To promote {individual fairness}~--~treating similar instances similarly~--~we compute question similarity using NLP embeddings and solve items in \emph{joint pairs of similar questions} rather than in isolation. The prompt enforces a global decision protocol: extract decisive clinical features, map each \((\text{question}, \text{option})\) to a score $f(x)$ that acts as confidence, and impose a Lipschitz-style constraint so that similar inputs receive similar scores and, hence, consistent outputs. Evaluated on the {MedQA (US)} benchmark, Metric-Fair Prompting is shown to improve performance over standard single-item prompting, demonstrating that fairness-guided, confidence-oriented reasoning can enhance LLM accuracy on high-stakes clinical multiple-choice questions.


Appendix A

Neural Information Processing Systems

Q: For what purpose was the dataset created? Q: Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., Q: Who funded the creation of the dataset? Q: What do the instances that comprise the dataset represent (e.g., documents, photos, people, Q: How many instances are there in total (of each type, if appropriate)? As shown in Table 1, the dataset statistics are as follows: Grounding Task: 111,770 samples for training, 21,616 samples for testing. For grounding, we use only one annotation per image.